NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Piikun: an information theoretic toolkit for analysis and visualization of species delimitation metric space

https://doi.org/10.1186/s12859-024-05997-y

Sukumaran, Jeet; Meila, Marina (December 2024, BMC Bioinformatics)

Abstract <bold>Background</bold>Existing software for comparison of species delimitation models do not provide a (true) metric or distance functions between species delimitation models, nor a way to compare these models in terms of relative clustering differences along a lattice of partitions. <bold>Results</bold>is a Python package for analyzing and visualizing species delimitation models in an information theoretic framework that, in addition to classic measures of information such as the entropy and mutual information [1], provides for the calculation of the Variation of Information (VI) criterion [2], a true metric or distance function for species delimitation models that is aligned with the lattice of partitions. <bold>Conclusions</bold>is available under the MIT license from its public repository (https://github.com/jeetsukumaran/piikun), and can be installed locally using the Python package manager ‘pip‘.
more » « less
The decomposition of the higher-order homology embedding constructed from the k-Laplacian

Chen, Yu-Chia; Meila, Marina (December 2021, Advances in neural information processing systems)
M. Ranzato; A. Beygelzimer; Y. Dauphin; P.S. Liang; J. Wortman Vaughan (Ed.)
The null space of the k-th order Laplacian Lk, known as the {\em k-th homology vector space}, encodes the non-trivial topology of a manifold or a network. Understanding the structure of the homology embedding can thus disclose geometric or topological information from the data. The study of the null space embedding of the graph Laplacian L0 has spurred new research and applications, such as spectral clustering algorithms with theoretical guarantees and estimators of the Stochastic Block Model. In this work, we investigate the geometry of the k-th homology embedding and focus on cases reminiscent of spectral clustering. Namely, we analyze the {\em connected sum} of manifolds as a perturbation to the direct sum of their homology embeddings. We propose an algorithm to factorize the homology embedding into subspaces corresponding to a manifold's simplest topological components. The proposed framework is applied to the {\em shortest homologous loop detection} problem, a problem known to be NP-hard in general. Our spectral loop detection algorithm scales better than existing methods and is effective on diverse data such as point clouds and images.
more » « less
Full Text Available
Selecting the independent coordinates of manifolds with large aspect ratios

Chen, Yu-Chia; Meila, Marina (December 2019, Advances in neural information processing systems)

Many manifold embedding algorithms fail apparently when the data manifold has a large aspect ratio (such as a long, thin strip). Here, we formulate success and failure in terms of finding a smooth embedding, showing also that the problem is pervasive and more complex than previously recognized. Mathematically, success is possible under very broad conditions, provided that embedding is done by carefully selected eigenfunctions of the Laplace-Beltrami operator Δ_M. Hence, we propose a bicriterial Independent Eigencoordinate Selection (IES) algorithm that selects smooth embeddings with few eigenvectors. The algorithm is grounded in theory, has low computational overhead, and is successful on synthetic and large real data.
more » « less
Full Text Available
How to tell when a clustering is (approximately) correct using convex relaxations

Meila, Marina (December 2018, Advances in neural information processing systems)

We introduce the Sublevel Set (SS) method, a generic method to obtain sufficient guarantees of near-optimality and uniqueness (up to small perturbations) for a clustering. This method can be instantiated for a variety of clustering loss functions for which convex relaxations exist. Obtaining the guarantees in practice amounts to solving a convex optimization. We demonstrate the applicability of this method by obtaining distribution free guarantees for K-means clustering on realistic data sets.
more » « less
Full Text Available

Search for: All records